refactor: adopt mixin chain + emit per-phase spans by timzsu · Pull Request #49 · mlsys-io/FlowMesh

timzsu · 2026-05-14T09:06:11Z

Purpose

The four omni executors predated the executor mixin chain (InferenceMixin → DataMixin → GovernanceMixin) — the last family still on a plain-Executor base. This PR moves them onto the chain so they emit OTel traces like the inference and training executors, adds three per-phase spans inside each run(), routes prompts through the mixin's data path, and uploads spans to the server's /traces endpoint so HTTP-destination workers don't strand their trace JSONL on the worker filesystem. RFC #48 omni item.

Changes

omni_executor_base.py: OmniExecutorBase inherits (InferenceMixin, Executor). The executor-local collect_text_inputs helper is removed; prompts now come from DataMixin._collect_prompts_for_spec, with each call site narrowing PromptInput → str inline and raising ExecutionError if any item is not a str (omni executors don't consume the chat-message form).
omni_text2{image,speech,audio,general}_executor.py: each run() wraps self._run_inner(...) in self._task_span(...), then calls maybe_upload_artifacts(...) and maybe_upload_traces(...) after the span exits — matches the vllm_executor / diffusers_executor shape. Doing the uploads after __exit__ is required so the root task span row is flushed into spans.jsonl before the trace upload reads it. Without maybe_upload_traces, omni span files stayed on the worker filesystem in HTTP-destination deployments and never reached the server's /api/v1/traces/workflows/{wfl}/spans endpoint.
Three SpanType.COMPUTE sub-spans inside the task span: model load (_ensure_omni), generation (the omni.generate loop / streaming generator), output postprocessing (artifact save loop). Attributes carry prompt_count, item_count, flowmesh.type=compute.
examples/templates/omni_text2{speech,audio,general}.yaml: migrated from data.text: "..." to the canonical data.type: list / items: [...] mixin shape.
tests/worker/test_omni_executor_inheritance.py: parametrized over the four executor classes + the base; asserts each is a subclass of InferenceMixin, DataMixin, GovernanceMixin. Catches future regressions of the base class hierarchy.

Test Plan

uv run pytest tests/worker/test_omni_executor_inheritance.py tests/server tests/shared tests/sdk tests/cli — 537 passed, mypy clean across the touched files.
Live e2e on one GPU worker, all four omni templates. For each: ok=True, expected artifact on disk (generated_tts.wav / generated_image_*.png / bgm.wav / narration.wav), spans include task, model load, generation, output postprocessing, prompts threaded through DataMixin._collect_prompts_for_spec.

Test Result

537 unit tests passed; mypy clean.

Live e2e (1 GPU worker, omni-mixins images):

Workflow	total (s)	model load	generation	output postprocessing	other
omni_text2speech	147.03	119.72s (81.4 %)	27.30s (18.6 %)	0.00s (0.0 %)	0.01s
omni_text2image	105.33	90.74s (86.1 %)	13.92s (13.2 %)	0.65s (0.6 %)	0.02s
omni_text2audio	31.52	29.59s (93.9 %)	1.92s ( 6.1 %)	0.00s (0.0 %)	0.01s
omni_text2general	451.74	450.79s (99.8 %)	0.92s ( 0.2 %)	0.03s (0.0 %)	0.01s

Cold weights dominate; generation second; postprocessing negligible. Sub-spans sum to within 20 ms of the root in every case. Image, audio, general re-validated on the post-mixin-migration commit; results match the table above. Trace-upload fix is behavior-only for HTTP-destination workers and doesn't change wall-clock; local-stack mode reads spans off the shared docker volume regardless.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

- OmniExecutorBase inherits (InferenceMixin, Executor) so the four omni executors pick up GovernanceMixin / DataMixin / InferenceMixin from the same chain the inference and training executors use. - Each concrete omni executor wraps run() with self._task_span(...) so a 'task' root span is emitted with executor.name + workflow_id. - Inside run(), three per-phase compute spans are added — 'model load' (_ensure_omni), 'generation' (the omni.generate call(s)), and 'output postprocessing' (artifact save + items build) — mirroring the vllm executor's tracing shape. - New tests/worker/test_omni_executor_inheritance.py asserts the full mixin chain on each omni executor class as a compile-time guard against regression. Live e2e on a single GPU worker against all four omni templates (omni_text2{speech,image,audio,general}.yaml): each task reports ok=True with the expected artifact, and spans.jsonl contains the 'task' root plus 'model load' / 'generation' / 'output postprocessing' sub-spans. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

…spec Replace the executor-local collect_text_inputs helper with the mixin's _collect_prompts_for_spec. Each omni executor now narrows PromptInput to str inline and raises ExecutionError if any item is not a string. Templates adopt the canonical data.type: list / items shape. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Move maybe_upload_artifacts out of the task span and add a missing maybe_upload_traces call right after, matching the vllm and diffusers pattern. Without the trace upload, omni span JSONL stayed on remote workers and never reached the server's /traces endpoint in HTTP mode. Also extract _run_inner in the image and speech executors so the post-span fall-through is the same shape across all four. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

Minor comments. Additional consideration:

Another cleanup you should consider is to move the run method from the Omni executor classes to OmniExecutorBase, define the _run_inner method as an abstract method whose spec parameter is of type TaskSpecStrictBase., and define a class attribute _TASK_SPEC_TYPE. In this way, OmniExecutorBase can call self.require_spec(task, self._TASK_SPEC_TYPE) inside the generic self.run.

A drawback of this approach is that you need to call assert isinstance(spec, <spec-type>) as the first line of every concrete _run_inner.

Each concrete omni executor's run() did the same five things: resolve spec, dump dict, normalize out_dir, run the task span, upload artifacts and traces. Move that boilerplate to OmniExecutorBase.run() and let subclasses contribute via a _TASK_SPEC_TYPE class attribute plus an abstract _run_inner whose first line is `assert isinstance(spec, ...)` to recover the concrete type. Also adopt the cast(list[str], raw_prompts) form for the prompt-string narrowing in all four executors so the pattern reads identically. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz

LGTM.

timzsu changed the title ~~refactor(omni): adopt mixin chain + emit per-phase spans (RFC #48)~~ refactor: adopt mixin chain + emit per-phase spans (RFC #48) May 14, 2026

timzsu added 2 commits May 14, 2026 10:08

timzsu marked this pull request as ready for review May 14, 2026 10:45

timzsu requested a review from kaiitunnz as a code owner May 14, 2026 10:45

timzsu mentioned this pull request May 14, 2026

[RFC]: FlowMesh 2026 Q2 Roadmap #48

Open

9 tasks

small patch

188e305

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

timzsu changed the title ~~refactor: adopt mixin chain + emit per-phase spans (RFC #48)~~ refactor: adopt mixin chain + emit per-phase spans May 14, 2026

timzsu force-pushed the zsu/rfc48-omni-mixins branch from 47dd136 to 188e305 Compare May 14, 2026 16:25

timzsu requested a review from J1shen May 15, 2026 05:05

kaiitunnz requested changes May 15, 2026

View reviewed changes

timzsu requested a review from kaiitunnz May 15, 2026 08:54

refactor: introduce _collect_text_inputs as a shared helper

21b9149

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

kaiitunnz approved these changes May 15, 2026

View reviewed changes

timzsu merged commit 08d9640 into main May 15, 2026
11 checks passed

timzsu deleted the zsu/rfc48-omni-mixins branch May 15, 2026 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: adopt mixin chain + emit per-phase spans#49

refactor: adopt mixin chain + emit per-phase spans#49
timzsu merged 6 commits into
mainfrom
zsu/rfc48-omni-mixins

timzsu commented May 14, 2026 •

edited

Loading

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timzsu commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timzsu commented May 14, 2026 •

edited

Loading